MC 2 - Viz Assignment

Using visualisation techniques in R to investigate the disappearance of employees at GAStech

Manmit true
07-16-2021

1.Overview

This assignment is part of VAST Challenge 2021 - Mini Challenge 2 (“MC 2”) The challenge scenario is given as follows:

“In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearance, but things may not be what they seem.”

As part of Mini-Challenge 2, visual analytics was performed on i) employee movement using vehicle gps data; ii) credit card transactions; and iii) loyalty card usage data. From this, visual analysis identified anomalies and suspicious behaviours. This data will be used by the law-enforcement agents to get to the bottom of the case!

The visual analysis aims to provide answers to the following questions posed by MC-2: -

The assignment report was written using R Markdown in Distill format. The report is published on Nelify. The data is available on Github.

2.Literature Review

The literature review looks at 3 approaches used to solving MC-2 in 2014. Specifically, a review was done on the data visualisation approach and gaps were identified for i) alternative visual analytics could be used to enhance the user experience; and ii) identification of R packages that could be replicate useful visuals .

  1. Entry - CSU-Zhao-MC2 , VAST Challenge 2014, Mini-Challenge 2

Figure 2.1: From Top Left: Combined Visualisation for Analysis, RadViz and Parallel Coordinates and Matrix Visualisation

Zhao et. al. used D3, MYSQL and Excel to produce the visual for analytics. The team approached the challenge by i) segmenting car-tracking data and tracking start and end points using temporal colour coding on the map; ii) utilizing consumption data to locate precise positions; and iii) created a map by merging multiple layers – tourist map, with the road-network, location and legend layers.

For visual analysis, the team used i) RadViz Tool; ii) PMViz Tool; and iii) SGGVIz Tool. The RadViz was used to cluster consumer places and employees with common consumptions habits. PMViz was used for Parallel Coordinates and Matrix Visualisation to identify when people were together, by time, place and fleshes out events that happened when running through a timeseries. SGGViz was used to produce a Scatter view and Gantt Chart & GIS view. The visualisations were combined and cross-referenced with interactivity to explore and analyse the data.

The visualisations used are diverse to understand behaviours. Specifically, RadViz with clustering identifies consumption habit trends of employees. While the Parallel Coordinates and Matrix identifies behaviours. The map traces activity and travel route. These visualisations flesh out the given data neatly. Lastly, the gantt chart helps place the visual in a time-series. Notably, to build the map, the team constructed used 4 layers (legend, location, road-network and tourist map). This was useful when mapping the gps data to get a sense of the routes taken and locations visited.

Reflecting on their approach, it would be useful to construct the tourist map with the various layers. A search done recommended using raster, sf and tmaps packages in R to accomplish this. For timeseries data, Gantt charts and the timetk package will be explored. Aside, instead of using Radviz in R, the visNetwork package could be useful to build nodes and edges to understand the connections between locations and credit card usage. visNetwork has interactivity for the user to explore underlying relationships. Another approach could be to use a correlation matrix to relate the employees to each other.

  1. Entry - Purdue-Guo-MC2 , VAST Challenge 2014, Mini-Challenge 2

Figure 2.1: Visual Analysis Interactive Visual

Guo et. al. used Dodeca-Rings Map together with a temporal chart and social relationship matrix. The team used HTML5, CSS, SVG, JavaScripts, jQuery, PHP, MySQL and D3 to query, compute, serve and visualize data. The visual is interactive and helps the user understand the activity in a time-series along with the relationships between the employees. The dodecagons flesh out the events on the map. All the visuals are inked for the user to differentiate events.

Figure 2.3: Tooltip for Dodecagons

The dodecagons also have a tooltip feature which goes into deeper details. However, a lot of detail is shown within a tool tip, it could have been more useful to showcase the details in another window below the dodeca-ring map for exploration. In addition, while the Dodeca rings are neater, for persons unfamiliar with Dodecans, it would be challenging to interpret.

Figure 2.4:Temporal Chart and Social Relationship Matrix

The temporal chart and social relationship matrix are useful to understand how persons related to one another and the car travel in a timeseries. There are also options to filter the within the temporal chart that makes the interaction easy and allow the user to dig deeper.

The inking used by the team is particularly attractive and easy to interpret and visualize data. Similar to Guo et. al. in (a), a time-series was used along with mapping of gps data on the map. Instead of using Radviz, the team used a correlation matrix. Reflecting on how the correlation matrix looks, it could be more useful to leverage a network graph instead. That could be more visually useful to spot clusters and relationships between people and locations. Furthermore, the interactive provides additional investigation of the data. Aside, detail to understand movement on the map is useful. In R, a possibility could be to use data tables that allow filtering of data. In addition, using the crosstalk package would link charts to the data tables.

  1. Entry - FIAIS-Andrienko-MC2 , VAST Challenge 2014, Mini-Challenge 2 FIAIS-Andrienko et. al.  used V-Analytics to develop the visual. The motivation for their visual seems to be on leveraging queries to explore the date.

Figure 2.5: Query Done on Visual

While this is great for exploration, it is not very beneficial when the user shows this to the rest of the investigation team as the data is not forthcoming and the team will have to continuously explore. This is good for deep investigation, but, not as beneficial for visual analytics to identify underlying trends given the continuous need to query. A good mix would be i) a visualisation that already fleshes out certain abnormalities; and ii) has exploration functions for each visual for the user to explore second-level details. Users unfamiliar with V-analytics may be overwhelmed by the options and the many legends and visuals used. For example, the screen shot below of geospartial data has a lot of going on and is not guided enough for the user to follow. As a recommendation, the map layout could be used given that it is intuitive to many, along with filter options for the vehicles or persons driving.

Figure 2.6: Using Geospartial Data

Lastly, the tabular representation of data is not a good idea when using visualisation as users will find it difficult to read across columns and rows. The colours are useful but it can be better represented using a temporal chart or Gant chart.

Figure 2.7: Data Visualisation Using a Table

To bridge the visual analytics gap, it could be useful to use the tmaps in R to plot individual routes taken and understand each user’s travel patterns. In addition, crosstalk could be used to link visuals to tables, vis-à-vis creating a query function. It would be useful for the data tables using crosstalk to have a filter function.

  1. Takeaways Key takeways from the above review include:-

3. Approach and Packages

In this assignment, both static and interactive visuals were created to explore and derive insights from the data. Specifically, bar charts, box-plots, heatmaps, visual mapping and network visuals were used.

The following packages were installed for the assignment: -

The data sets will be tackled individually to identify trends, suspicious and erratic behavior, and discover relationships. Thereafter, visualisations will be pieced together to address MC-2 questions.

4. Data Preparation

  1. Install and Load Packages The following is done in RStudio using Rmarkdown.
#Install and Launch R Packages

packages = c('igraph','visNetwork','lubridate', 'DT', 'ggiraph', 'plotly', 'tidyverse', 'raster', 'sf', 'tmap', 'clock', 'rgdal','dplyr', 
             'hms', 'crosstalk','timetk', 'tidygraph', 'ggraph', 'ggrepel', 'htmltools', 'sugrrants')

library(ggplot2)


for (p in packages){
if (!require(p, character.only = T)){
  install.packages(p)
  }
  library(p, character.only = T)
}
  1. Importing Data

The files provided (car-assignements, cc_data, loyalty_data and gps) are imported.Two other files were created in excel for data exploration to understand if it was a useful format to have the data in for creating visuals. These files are All_Combined2 and All_Combined3. Specifically, All_Combined2 merges cc and loyalty card data in the following format “Loyalty Date, LoyaltyLocation, Loyalty Price, LoyaltyNumber, CreditCardTimerstamp, Location, CreditCardPrice, CardNumber, CreditCardDate. All_Combined4 merges the cc_data and loyalty_card data in the following table form “Date, Location, Price, CC.Loyalty, CC.Loyalty.Type”.

#Importing car-assignment data
car_assign<- read_csv("data/car-assignments.csv")
credit_card<- read_csv("data/cc_data.csv")
gps <- read_csv("data/gps.csv")
loyalty <- read_csv("data/loyalty_data.csv")
combined_trans <-read_csv("data/All_Combined2.csv")
combined_trans_pivot <- read_csv("data/All_Combined4.csv")
  1. Data Cleaning

The steps below flesh out the data cleaning process.
For each data set, glimpse(“data set name”) was used to identify the format of the data.Thereafter, the following was performed :-

The cleaned data sets are fleshed below the code chunk.

#Clean Data

#1. For Car_Assign, Truck Drivers have no Car Id. Change 'NA' to '0' and CarID used as_factor
car_assign[is.na(car_assign)] = 0
car_assign$CarID <- as_factor(car_assign$CarID)
car_assign_clean <- car_assign


#2. Credit Card Timestamp data needs to be in the proper attribute
credit_card$timestamp <- date_time_parse(credit_card$timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M")
credit_card$last4ccnum <- as_factor(credit_card$last4ccnum)
credit_card_clean <- credit_card #save data set as credit_card_clean

#The credit card timestamp data can be further split 
credit_card_split <- credit_card_clean %>%
  mutate(year = format(timestamp,"%m/%d/%Y"),
         time = format(timestamp,"%H:%M:%S"))
credit_card_split$year <- date_parse(credit_card_split$year,
                                format = "%m/%d/%Y")
credit_card_split$time <- as.hms(credit_card_split$time)

#3. GPS Timestamp needs to be in the proper attribute
gps$Timestamp <- date_time_parse(gps$Timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M:%S")
gps$id <- as_factor(gps$id) #id should be a factor and not continuous numerical format
gps_clean <- gps #save data set as gps_clean

#4. Loyalty Card Data - timestamp needs to be in the proper attribute and loyalty number as factor
loyalty$timestamp <- date_parse(loyalty$timestamp,
                                format = "%m/%d/%Y")
loyalty$loyaltynum <- as_factor(loyalty$loyaltynum)
loyalty_clean <- loyalty #save data set as loyalty_clean

#4. For exploration - combined loyalty and credit card data.
#Sort of the date and time formats
combined_trans$LoyaltyDate <- date_parse(combined_trans$LoyaltyDate,
                                         format = "%m/%d/%Y")
combined_trans$CreditCardDate <- date_parse(combined_trans$CreditCardDate,
                                         format = "%m/%d/%Y")
combined_trans$CreditCardTimestamp <- date_time_parse(combined_trans$CreditCardTimestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M")

#Identifiers should be in fct format
combined_trans$LoyaltyNumber <- as_factor(combined_trans$LoyaltyNumber)
combined_trans$CardNumber<- as_factor(combined_trans$CardNumber)
combined_trans$LoyaltyNumber<- as_factor(combined_trans$LoyaltyNumber)



#5. Pivoted data
combined_trans_pivot$Date <- date_parse(combined_trans_pivot$Date,
                                         format = "%m/%d/%Y")
combined_trans_pivot$CC.Loyalty <- as_factor(combined_trans_pivot$CC.Loyalty)
combined_trans_pivot$Location <- as_factor(combined_trans_pivot$Location)


# 6. Looking at the cleaned data sets
glimpse(loyalty_clean)
Rows: 1,392
Columns: 4
$ timestamp  <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <fct> L2247, L9406, L8328, L6417, L1107, L4034, L6110, ~
glimpse(credit_card_clean)
Rows: 1,490
Columns: 4
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(gps_clean)
Rows: 685,169
Columns: 4
$ Timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ id        <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
glimpse(combined_trans)
Rows: 1,490
Columns: 9
$ LoyaltyDate         <date> 2014-01-06, 2014-01-06, 2014-01-06, 201~
$ LoyaltyLocation     <chr> "Brew've Been Served", "Brew've Been Ser~
$ LoyaltyPrice        <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 1~
$ LoyaltyNumber       <fct> L2247, L9406, L8328, L6417, L1107, L4034~
$ CreditCardTimestamp <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:0~
$ Location            <chr> "Brew've Been Served", "Hallowed Grounds~
$ CreditCardPrice     <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 2~
$ CardNumber          <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253~
$ CreditCardDate      <date> 2014-01-06, 2014-01-06, 2014-01-06, 201~
glimpse(combined_trans_pivot)
Rows: 2,882
Columns: 4
$ Date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ Location   <fct> Brew've Been Served, Brew've Been Served, Hallowe~
$ Price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ CC.Loyalty <fct> Loyalty, Loyalty, Loyalty, Loyalty, Loyalty, Loya~
  1. Merging Data sets The cc and loyalty data was merged and imported earlier. As such, it will not be done in R. However, R is capable of doing this by using a right join and pivoting the data table. Please refer to this linkfor more information.

Next, given that CarID is tagged to employee name, GPS data is merged with Car Assignment data as follows:

#Merge GPS with car assignment. Merge on CarId

Employee_travel <- merge(gps_clean, car_assign_clean, by.x="id", by.y="CarID")
names(Employee_travel)[names(Employee_travel) == "id"] <- "CarID"
Employee_travel$NameTitle <- paste(Employee_travel$Name, Employee_travel$CurrentEmploymentTitle)
glimpse(Employee_travel)
Rows: 613,077
Columns: 10
$ CarID                  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ Timestamp              <dttm> 2014-01-09 17:50:36, 2014-01-07 18:5~
$ lat                    <dbl> 36.05902, 36.06416, 36.06071, 36.0725~
$ long                   <dbl> 24.88247, 24.87959, 24.88340, 24.8666~
$ LastName               <chr> "Calixto", "Calixto", "Calixto", "Cal~
$ FirstName              <chr> "Nils", "Nils", "Nils", "Nils", "Nils~
$ Name                   <chr> "Nils Calixto", "Nils Calixto", "Nils~
$ CurrentEmploymentType  <chr> "Information Technology", "Informatio~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "IT Helpdesk", "IT Hel~
$ NameTitle              <chr> "Nils Calixto IT Helpdesk", "Nils Cal~

###ime to get a sense of what was happening!!!

5. Exploring Credit Card data

  1. Popular Places by Credit Card (“cc”) Data

Using credit card usage count and location, a bar chart is plotted to identify locations where credit cards were more frequently used.

#This is the start of data exploration
#Exploring the Credit Card Data

#1. Most Popular Places - barchart

credit_card_clean %>%
  mutate(location = fct_infreq(location)) %>%
  ggplot(aes(x = location)) +
  geom_bar()+
  xlab("Location") +
  ylab("Visits Count") +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1))+
  labs(title = "Most frequented places according to credit card data")+
  coord_flip()

Katerina Cafe, Hippokampos, Brew’ve Been Served and Guy’s Gyros were most frequented by the employees according to credit card usage. Katerina Cafe, Brew’ve Been Served and Guy’s Gyros are likely F&B outlets, but it is unclear at this juncture what Hippokampos is. Given that Kronos Kares benefit card is given to GASTech employees giving them discounts and rewards, we could expect loyalty card data to show a similar trend in visits. Aside, Daily Dealz, U-Pump, and Frank’s Fuel are places where credit cards were least used and assumed least frequented at this juncture.

  1. Most Used Credit Cards Plotting the credit card number (last 4 numbers) by count, we examine the most used credit cards.
#2. Most Used Credit Cards 
credit_card_clean %>%
  mutate(last4ccnum = fct_infreq(last4ccnum)) %>%
  ggplot(aes(x = last4ccnum)) +
  geom_bar()+
  xlab("Last 4 digits of Credit Card") +
  ylab("Useage Count") +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1))+
  labs(title = "Credit cards that were most used")+
  coord_flip()

The bar chart highlights cc ending with 6901, 8332 and 3484 as the most used. While 9614, 9152, and 5010 were the least used. The difference between the most and least used credit cards is ~3x.

  1. Outliers in Credit Card Use by Spend and Location Outliers could highlight unusual credit card spend. The price (y axis) is plotted against the cc numbers (x axis). To get a sense of each location’s outliers, the boxplots are faceted by location.
#3 Outliers for Credit Card expenditure

#static check
check_outlierS <- credit_card_clean %>%
    ggplot(aes(x=last4ccnum, y=price)) +
    geom_boxplot() +
    facet_wrap(~location, ncol = 5, scales = "free") +
    theme_minimal() +
    theme(panel.border = element_rect(colour="grey60", fill = NA),
          axis.text.x.bottom = element_blank())
check_outlierS

Some interesting information is beginning to show!

The outliers can be further explored using ggplotly wrapper that give the boxplots interactivity.

#dynamic check
check_outlierS <- credit_card_split %>%
    ggplot(aes(x=last4ccnum, y=price)) +
    geom_boxplot() +
    geom_point(position="jitter",size = 0.5)+
  facet_wrap(~location, ncol = 2, scales = "free")
  
ggplotly(check_outlierS)
  1. Understanding the spending patterns

A heatmap of the expenditure is plotted along with a datable. The heatmap plots cc number (day) against location, coloured with price. This provides a quick visual of the outlier prices by cc number and place. Alongside, a data table is plotted to gather second level details.

#4. HeatMap of Expenditure of Credit Card Holder at Various Locations
p_hm2 <- plot_ly(data=credit_card_split,
        x= ~last4ccnum,
        y= ~location,
        color=~price) %>%
  layout(title = 'Heatmap of expenditure of credit card holder at various locations')
p_hm2
DT::datatable(credit_card_split,
              filter = 'top') %>%
  formatStyle(0,
              target = 'row',
              lineHeight = '60%')

From the visual above, the following cc numbers were investigated.

Travelling from the refinery to Stewart and Sons Fabrication once a week. The employee has interesting expenditure patterns, spending $4513 at the refinery and then $1738 at Stewart and Son’s. The person then spends $1903 and $4545 at the Refinery and Stewart’s respectively the next week. The expenditure only occurs in the morning and the employee does not spend anywhere else within Kronos.

Employee logged expenditure in the ‘000s at the Carlyle Chemical, sometimes on the same day, other times in consecutive days. The employee also spends ‘000s at the refinery. His expenditure at the Refinery is usually between 11 – 12pm. At Carlyle Chemical, he spends at least once in the morning between 930 to 10. The expenditures in the afternoon vary between 2 – 5pm. Lastly, he goes to the Katerina’s before going to Carlyle’s.

Spends regularly at the cafes. The employee seems to eat out a lot, being at places like Katerina’s, Guy’s and Brew’ve Been Served during breakfast, lunch and dinner timings. From this we can derive that Ouzeri Eilan could be a F&B outlet as well.

Spent $150 at the Kronos mart at 3:48am. This is a strange time to spend that much at a mart.

Filtering data table for the suspicious locations identified earlier

8642 went to the airport between the 6th -9th and 13th – 16th, spending twice each day between 12-1pm and 3-5pm.

Abila scrapyard was visited by 2276 on both weeks on Tuesday and Thursday, all times were in the afternoon by cc holder 2276. Zooming into 2276’s expenditure:

There is trend of places visited. The card holder starts spending on Tuesdays and Thursday by going to the Airport -> Scrapyard -> Stewart and Sons -> Refinery. Ooo I wonder what could be going on here!

Another contentious location. Spending at this place by employees mainly happens after 8pm! Notably, there are some employees who spend during the afternoon.

Two employees with cc numbers 2540 and 9683 spend almost within half an hour to each other. An oulier – 5010- spend on 18th Jan at 12pm, a day after 9683 and 2540.

Majority of the cc spent were after 7pm, some as late as 920pm. Notable some spent at 3pm

  1. Overall view of spending trends Using a Gantt chart, by plotting Location (y-axis) against timestamp(x-axis), overall patterns of spending by time are noticed.
#6. Credit Card Amount spent at Location 
ts_hm_cc2 <- credit_card_split %>%
 filter(timestamp >= "2014-01-05 15:28:00" & timestamp <= "2014-01-19 04:51:00") %>%
 ggplot() +
  aes(
    x = timestamp,
    y = location,
    fill = price,
    colour = price,
    tooltip = last4ccnum
  ) +
  geom_tile(size = 1.2) +
  scale_fill_distiller(palette = "Set1", direction = 1) +
  scale_color_distiller(palette = "Set1", direction = 1) +
  labs(
    x = "Timestamp",
    y = "Location",
    title = "Heatmap - Amount spend at location"
  ) +
  theme_minimal()


ggplotly(ts_hm_cc2)
DT::datatable(credit_card_split ,
              filter = 'top') %>%
  formatStyle(0,
              target = 'row',
              lineHeight = '60%')

-Stewart and Sons Fabrication: Consistent spending over the first week with larger amounts, but spend value tapers off during the second week.

  1. Spending patterns by each cc owner by location

Here we explore how the credit card is spending at each location. Is the cc card owner consistent in spending or are there peaks and troughs? Plot price (y-axis) by credit card date(x-axis) in a line graph and facet with location.

#6.1
ts_hm_cc3 <- combined_trans %>%
 filter(!is.na(LoyaltyDate)) %>%
 filter(!is.na(LoyaltyLocation)) %>%
 filter(!is.na(LoyaltyPrice)) %>%
 filter(CreditCardTimestamp >= "2014-01-05 15:28:00" & CreditCardTimestamp <= "2014-01-19 04:51:00") %>%
 ggplot() +
  aes(
    x = CreditCardDate,
    y = CreditCardPrice,
    colour = CardNumber
  ) +
  geom_line(size = 0.5) +
  scale_color_hue(direction = 1) +
  labs(
    x = "Date",
    y = "Price",
    title = "Line Chart Credit Card Trend Expenditure"
  ) +
  theme_minimal() +
  facet_wrap(vars(Location))

ggplotly(ts_hm_cc3)
DT::datatable(combined_trans ,
              filter = 'top') %>%
  formatStyle(0,
              target = 'row',
              lineHeight = '60%')

Putting the spend behaviours together, some erratic card behaviour is observed. The detailed observations have been stated.

#6.2

ts_hm_cc4 <- combined_trans %>%
 filter(!is.na(LoyaltyDate)) %>%
 filter(!is.na(LoyaltyLocation)) %>%
 filter(!is.na(LoyaltyPrice)) %>%
 filter(CreditCardTimestamp >= "2014-01-05 15:28:00" & CreditCardTimestamp <= "2014-01-19 04:51:00") %>%
 ggplot() +
  aes(
    x = CreditCardDate,
    y = CreditCardPrice,
    colour = Location
  ) +
  geom_line(size = 0.5) +
  scale_color_hue(direction = 1) +
  labs(
    x = "Date",
    y = "Price",
    title = "By Location - Credit Card Trend Expenditure"
  ) +
  theme_minimal()
ggplotly(ts_hm_cc4)

To find out more about the relationships, the following chart can be explored.

#6.3

ts_hm_cc5 <- combined_trans %>%
 filter(CreditCardTimestamp >= "2014-01-05 15:28:00" & CreditCardTimestamp <= "2014-01-19 04:51:00") %>%
 ggplot() +
  aes(x = CreditCardDate, y = CardNumber, fill = Location) +
  geom_tile(size = 1.2) +
  scale_fill_hue(direction = 1) +
  labs(
    x = "Date",
    y = "Credit Card Number",
    title = "Credit Card Number by Location and Day"
  ) +
  theme_minimal()
ggplotly(ts_hm_cc5)

6. Time to get a sense of Loyalty Card Data!

  1. Most Frequented placed according to loyalty data Plotting a bar chart with location against visits count.
#Exploring Loyalty Data

#7. Most Popular Places - barchart

loyalty_clean %>%
  mutate(location = fct_infreq(location)) %>%
  ggplot(aes(x = location)) +
  geom_bar()+
  xlab("Location") +
  ylab("Visits Count") +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1))+
  labs(title = "Most frequented places according to loyalty data")+
  coord_flip()

The bar chart shows that Katerina Cafe, Hippokampos, Guy’s Gyros were the top 3 places where the loyalty card was used. This is similar to the credit card data. However, the number of times used is different. The least frequented places are U-pump, Frank’s Fuel and Octavo’s Office Supplies. This differs slightly from the credit card data, where Daily Dealz was one of the places with the lowest credit card usage. Noticeably, Daily Dealz was not a location where the loyalty count was used.

  1. Most Active Users of the Loyalty Card

The usage count is plotted against the loyalty number in a bar chart to understand which card uses were the most active.

#8. Most Used Loyalty Cards
loyalty_clean %>%
  mutate(loyaltynum = fct_infreq(loyaltynum )) %>%
  ggplot(aes(x = loyaltynum )) +
  geom_bar()+
  xlab("Loyalty Number") +
  ylab("Useage Count") +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1))+
  labs(title = "Loyalty cards that were most used")+
  coord_flip()

The most active loyalty cards were L6267, L2490, L2070, and L3366. The least active were L5924, L5485, L2459, and L8477.

  1. Outliers for Loyalty Cards

Plotting a boxplot with loyaltynum against price, faceted by location gives an indication of outlier data points. From here, we can understand if patterns are similar for both credit card and loyalty cards. In addition, for glaring expenditures, it would be possible to identify a cc match with loyalty card.

#9 Outliers for Loyalty expenditure

#static check
check_outliers__lylty <- loyalty_clean %>%
    ggplot(aes(x=loyaltynum, y=price)) +
    geom_boxplot() +
    facet_wrap(~location, ncol = 5, scales = "free") +
    theme_minimal() +
    theme(panel.border = element_rect(colour="grey60", fill = NA),
          axis.text.x.bottom = element_blank())
check_outliers__lylty

The trends point to the loyalty card being used by employees at F&B outlets such as Guy’s Gyros and Katerina’s cafe. Outliers reflect a similar trend seen in the usage of credit card data. On that note, for glaring outliers such as the spend at Abila Scrapyard, where only one person spent on particular days, we can connect the dots that a particular loyalty card was used by the credit card holder. We can assume this with some level of confidence. For more popular locations, it would be more challenging to match the data based on price and day to connect the credit card number to loyalty card as loyalty cards could change hands when groups of people visit, for certain individuals to benefit from loyalty card usage. This will be investigated further.

d.Using the heatmap to identify loyalty card holder patterns by location and price Similar to the credit card heatmap, loyaltynum is plotted against location.

#10. HeatMap of Expenditure of Loyalty Holder at Various Locations
p_hm_loyal <- plot_ly(data=loyalty_clean,
        x= ~loyaltynum,
        y= ~location,
        color=~price) %>%
  layout(title = 'Heatmap of expenditure of loyalty card holder at various locations')
  
p_hm_loyal
DT::datatable(loyalty_clean,
              filter = 'top') %>%
  formatStyle(0,
              target = 'row',
              lineHeight = '60%')

Some interesting points to note!

Loyalty Card Data table:

Credit Card Data table:

The card holder went to the Airport. Observing the data tables, the patterns are similar to the credit card spending patterns. The card recorded visits to airport between the 6th - 9th and 13th – 16th, spending twice each day at times. However, upon closer inspection, when comparing the loyalty and credit card data tables, the amounts are not similar. Comparing the dates and amount spent, it is noted that the amount is not the same. On 6th Jan, the credit spend was $612.47 for loyalty cards but $1873 for the credit card.

This loyalty card number was picked out reflecting on the contentious activity at the airport. Investigating this number shows that the loyalty card has been to places that were highlight as contentious and perhaps where suspicious behaviour took place. This includes Carlyle Chemical, Abila Airport, Kronos Pipe and Irrigation and Maxi Iron and Steel. We will keep an eye on you L4063!

Loyalty Card data

Comparing this to the credit car card: The data tables highlight the credit card expenditure. It is observed that given the price, cc 9220 seems to own L4063. cc 7792 linked to L5756, cc4530 linked to L8477. However, 7792 spent $4803 on 7th afternoon but did not use the loyalty card. We are suspicious of this!

  1. Spending patterns

Plotting y= price, x = timestamp, filled with price, line charts can be used to identify peaks and troughs in expenditure with loyalty cards. Lets see what we find!

#12.1
ts_hm_loyalty3 <-combined_trans %>%
 filter(!is.na(LoyaltyLocation)) %>%
 filter(!is.na(LoyaltyPrice)) %>%
 filter(CreditCardTimestamp >= 
 "2014-01-05 15:28:00" & CreditCardTimestamp <= "2014-01-19 04:51:00") %>%
 ggplot() +
  aes(
    x = LoyaltyDate,
    y = LoyaltyPrice,
    colour = LoyaltyNumber
  ) +
  geom_line(size = 0.5) +
  scale_color_hue(direction = 1) +
  labs(
    x = "Timestamp",
    y = "Price",
    title = "Loyalty Card Spending Patterns by Location and Loyalty"
  ) +
  theme_minimal() +
  facet_wrap(vars(LoyaltyLocation))

ggplotly(ts_hm_loyalty3)

Further trends can be explored using the following:

#12.2
ts_hm_loyalty4 <- combined_trans %>%
 filter(!is.na(LoyaltyDate)) %>%
 filter(!is.na(LoyaltyLocation)) %>%
 filter(!is.na(LoyaltyPrice)) %>%
 filter(CreditCardTimestamp >= "2014-01-05 15:28:00" & CreditCardTimestamp <= "2014-01-19 04:51:00") %>%
 ggplot() +
  aes(
    x = LoyaltyDate,
    y = LoyaltyLocation,
    fill = LoyaltyNumber
  ) +
  geom_tile(size = 1.2) +
  scale_fill_hue(direction = 1) +
  labs(title = "Loyalty Number by Date by Location") +
  theme_minimal()

ggplotly(ts_hm_loyalty4)
#12.3
ts_hm_loyalty5 <- combined_trans %>%
 filter(!is.na(LoyaltyDate)) %>%
 filter(!is.na(LoyaltyLocation)) %>%
 filter(!is.na(LoyaltyPrice)) %>%
 filter(CreditCardTimestamp >= "2014-01-05 15:28:00" & CreditCardTimestamp <= "2014-01-19 04:51:00") %>%
 ggplot() +
  aes(x = CreditCardDate, y = CardNumber, fill = Location) +
  geom_tile(size = 0.5) +
  scale_fill_hue(direction = 1) +
  labs(
    x = "Date",
    y = "Loyalty Card Number",
    title = "Loyalty Card Number by Date"
  ) +
  theme_minimal()
ggplotly(ts_hm_loyalty5)

7. Exploring travel patterns using GPS Data

For this, it is important to note that the vehicles are tracked periodically as long as they are moving. On that note, the gps available shows when the employees are on the move. Let us see who has been where!

  1. Movement of employees Using a grant chart, we plot the employees name concatenated with title along the timestamp (x-axis). This will flesh out travel patterns over the two weeks.
#15 Looking at Movement Data

Employee_travel %>%
 filter(Timestamp >= "2014-01-05 14:28:01" & Timestamp <= "2014-01-19 04:56:55") %>%
 ggplot() +
  aes(x = Timestamp, y = NameTitle, colour = CurrentEmploymentType) +
  geom_tile(size = 1.2) +
  scale_color_hue(direction = 1) +
  labs(
    x = "Timestamp",
    y = "Name",
    title = "Movement of Employees"
  ) +
  theme_linedraw()

From this, some patterns are highlighted and discussed:

  1. Sten/President and CEO was away for a long period and only started being tracked after January 13.

  2. Nils/ IT Helpdesk, Loreto/ Site Control and Isia/Perimeter Control travelled at an odd timing given that the rest of the employees did not do this. This should be investigated further for time of travel and locations.

  3. Birgitta/Geologist and Minke/Perimeter Control have a similar gap in their travel patterns.

  4. Kare Orilla/ Drill Technician, Hideki/Site Control, and Brand/ Drill Technician have unusual large gaps before the day of the disappearance. These patterns are exhibited earlier.

  1. Investigating Time lag

The following code chunk is used to understand the time difference between the travel times. Setting it at 10mins gives an indication if the vehicle stopped and visited a location. While the location names are not tagged, by plotting the employees against time and using the time lag, their behaviour patterns can be understood and investigated further.

#16 Lets use lag method to investigate

gps_diff <- gps %>%
  arrange(id, Timestamp, lat, long) %>%
  group_by(id) %>%
  mutate(diff = Timestamp - lag(Timestamp),
         diff_mins = (as.numeric(diff, units = 'secs'))/60) %>%
  filter(diff_mins > 10)




Employee_travel2 <- merge(gps_diff, car_assign_clean, by.x="id", by.y="CarID")
names(Employee_travel2)[names(Employee_travel2) == "id"] <- "CarID"
glimpse(Employee_travel2)
Rows: 2,728
Columns: 11
$ CarID                  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ Timestamp              <dttm> 2014-01-14 19:26:01, 2014-01-14 13:2~
$ lat                    <dbl> 36.06645, 36.06582, 36.04803, 36.0664~
$ long                   <dbl> 24.88264, 24.85247, 24.87957, 24.8826~
$ diff                   <drtn> 5998 secs, 3995 secs, 15095 secs, 40~
$ diff_mins              <dbl> 99.96667, 66.58333, 251.58333, 674.00~
$ LastName               <chr> "Calixto", "Calixto", "Calixto", "Cal~
$ FirstName              <chr> "Nils", "Nils", "Nils", "Nils", "Nils~
$ Name                   <chr> "Nils Calixto", "Nils Calixto", "Nils~
$ CurrentEmploymentType  <chr> "Information Technology", "Informatio~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "IT Helpdesk", "IT Hel~
Employee_travel2$NameTitle <- paste(Employee_travel2$Name, Employee_travel2$CurrentEmploymentTitle)


time_lag_plot <- Employee_travel2 %>%
 filter(Timestamp >= "2014-01-05 14:53:01" & Timestamp <= "2014-01-19 04:56:01") %>%
 ggplot() +
  aes(x = Timestamp, y = NameTitle, fill= diff_mins, colour = diff_mins) +
  geom_tile(size = 1.2) +
  scale_fill_distiller(palette = "Set1", direction = 1) +
  scale_color_distiller(palette = "Set1", direction = 1) +
  labs(
    x = "Timestamp",
    y = "Employee Name and Title",
    title = "Time Lag - Difference of > 10mins using GPS data"
  ) +
  theme_minimal()



ggplotly(time_lag_plot)

The image below highlights some time lags that do not conform to the majority of the employees movements.

Some time lag patterns were spotted across the following employees.

  1. Alda SVP/CIO, Felix/Engineer and Israde/ Drill Technician and Elsa/Drill Technician.

  2. Isla/ Perimeter Control and Israde/ Drill Technican.

Aside, the following patterns could be investigated as it sticks out from the majority of the low time lag patterns.

  1. Nils/IT Helpdesk

  2. Nils/IT Helpdesk

  3. Hernie/Perimeter Control

  4. Felix/ Group Manager

  5. Bertrand/Facilities group manager

  6. Aldra/ Badging Office

  1. Exploring travel patterns on the map!

The code chunk for plotting GPS data on the maps are as follows. From the code chunk, carid is filtered individually to zoom in and explore each car’s gps data. Lets see what suspicious activities we can spot starting with ID ==1, and running through each carid to flesh out suspicious activities. Specifically, looking for erratic travel patterns and irregularities.

#17. Map Plotting starts here
bgmap <- raster("data/Geospatial/MC2-tourist.tif")
bgmap
class      : RasterLayer 
band       : 1  (of  3  bands)
dimensions : 1595, 2706, 4316070  (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05  (x, y)
extent     : 24.82419, 24.90976, 36.04499, 36.09543  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : MC2-tourist.tif 
names      : MC2.tourist 
values     : 0, 255  (min, max)
#18
tmap_mode("plot") #there is a plot mode and interactive mode. 
tmap_mode("view") #interactive version

tm_shape(bgmap)+
tm_rgb(bgmap, r=1, g=2, b=3,
        alpha = NA,
        saturation = 1,
        interpolate = TRUE,
        max.value = 255)
#19 bring in shapefiles - .shp, .shx, .dbf etc 

Abila_st <- st_read(dsn = "data/Geospatial",
                    layer = "Abila" )
Reading layer `Abila' from data source 
  `C:\manmi1singh\Project\_posts\2021-07-16-third-time-a-charm\data\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84
#20 convert aspatial data into a simple feature, st_as_sf converts gps into a simple feature
gps_sf <- st_as_sf(gps_clean,
                   coords = c("long", "lat"),
                   crs = 4326)
#21 convert line to path and group by the ID

gps_path <- gps_sf %>%
  group_by(id) %>%
  summarize (m=mean(Timestamp), #dont need to calculate the mean but done because of the nature of group_by
             do_union=FALSE) %>%
  st_cast("LINESTRING") #st_cast will link all the gps point to a line string
#22do the mapping
gps_path_selected <- gps_path %>%
  filter(id==35)
tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r=1, g=2,b=3,
         alpha =NA,
         saturation = 1,
         interpolate = TRUE,
         max.value=255)+
  tm_shape(gps_path_selected)+
  tm_lines()